Overview of Data Frame Operations

Data frames are the workhorse of R, so in this lecture we will basically be creating a "cheat sheet" of common operations used with data frames and R. This will be a very useful lecture, and while we will be going over material we've already covered its vital to know data frames very well in order to save yourself tons of time in the future!

We're going to do an overview of the following common actions:

  • Creating Data Frames
  • Importing and Exporting Data
  • Getting Information about Data Frame
  • Referencing Cells
  • Referencing Rows
  • Referencing Columns
  • Adding Rows
  • Adding Columns
  • Setting Column Names
  • Selecting Multiple Rows
  • Selecting Multiple Columns
  • Dealing with Missing Data

Creating Data Frames

In [79]:
empty <- data.frame() # empty data frame

c1 <- 1:10 # vector of integers

c2 <- letters[1:10] # vector of strings

df <- data.frame(col.name.1=c1,col.name.2=c2)
In [59]:
df
Out[59]:
col_name_1col_name_2
11a
22b
33c
44d
55e
66f
77g
88h
99i
1010j

Importing and Exporting Data

In [ ]:
d2 <- read.csv('some.file.name.csv')

# For Excel Files
# Load the readxl package
library(readxl)
# Call info from the sheets using read.excel
df <- read_excel('Sample-Sales-Data.xlsx',sheet='Sheet1')

# Output to csv
write.csv(df, file='some.file.csv')

Getting Information about Data Frame

In [6]:
# Row and columns counts
nrow(df)
ncol(df)
Out[6]:
10
Out[6]:
2
In [7]:
# Column Names
colnames(df)
Out[7]:
  1. 'col_name_1'
  2. 'col_name_2'
In [9]:
# Row names (may just return index)
rownames(df)
Out[9]:
  1. '1'
  2. '2'
  3. '3'
  4. '4'
  5. '5'
  6. '6'
  7. '7'
  8. '8'
  9. '9'
  10. '10'

Referencing Cells

You can think of the basics as using two sets of brackets for a single cell, and a single set of brackets for multiple cells. For example:

In [32]:
vec <- df[[5, 2]] # get cell by [[row,col]] num

newdf <- df[1:5, 1:2] # get multiplt cells in new df

df[[2, 'col.name.1']] <- 99999 # reassign a single cell
In [33]:
df
Out[33]:
col_name_1col_name_2
11a
299999b
33c
44d
55e
66f
77g
88h
99i
1010j

Referencing Rows

Usually you'll use the [row,] format

In [43]:
# returns a data frame (and not a vector!)
rowdf <- df[1, ]
In [44]:
rowdf
Out[44]:
col_name_1col_name_2
11a
In [49]:
# to get a row as a vector, use following
vrow <- as.numeric(as.vector(df[1,]))
In [50]:
vrow
Out[50]:
  1. 1
  2. 1

Referencing Columns

Most column references return a vector:

In [62]:
cars <- mtcars
head(cars)
Out[62]:
mpgcyldisphpdratwtqsecvsamgearcarb
Mazda RX42161601103.92.6216.460144
Mazda RX4 Wag2161601103.92.87517.020144
Datsun 71022.84108933.852.3218.611141
Hornet 4 Drive21.462581103.083.21519.441031
Hornet Sportabout18.783601753.153.4417.020032
Valiant18.162251052.763.4620.221031
In [63]:
colv1 <- cars$mpg # returns a vector
colv1

colv2 <- cars[, 'mpg'] # returns vector
colv2

colv3<- cars[, 1] # a is int or string
colv3

colv4 <- cars[['mpg']] # returns a vector
colv4
Out[63]:
  1. 21
  2. 21
  3. 22.8
  4. 21.4
  5. 18.7
  6. 18.1
  7. 14.3
  8. 24.4
  9. 22.8
  10. 19.2
  11. 17.8
  12. 16.4
  13. 17.3
  14. 15.2
  15. 10.4
  16. 10.4
  17. 14.7
  18. 32.4
  19. 30.4
  20. 33.9
  21. 21.5
  22. 15.5
  23. 15.2
  24. 13.3
  25. 19.2
  26. 27.3
  27. 26
  28. 30.4
  29. 15.8
  30. 19.7
  31. 15
  32. 21.4
Out[63]:
  1. 21
  2. 21
  3. 22.8
  4. 21.4
  5. 18.7
  6. 18.1
  7. 14.3
  8. 24.4
  9. 22.8
  10. 19.2
  11. 17.8
  12. 16.4
  13. 17.3
  14. 15.2
  15. 10.4
  16. 10.4
  17. 14.7
  18. 32.4
  19. 30.4
  20. 33.9
  21. 21.5
  22. 15.5
  23. 15.2
  24. 13.3
  25. 19.2
  26. 27.3
  27. 26
  28. 30.4
  29. 15.8
  30. 19.7
  31. 15
  32. 21.4
Out[63]:
  1. 21
  2. 21
  3. 22.8
  4. 21.4
  5. 18.7
  6. 18.1
  7. 14.3
  8. 24.4
  9. 22.8
  10. 19.2
  11. 17.8
  12. 16.4
  13. 17.3
  14. 15.2
  15. 10.4
  16. 10.4
  17. 14.7
  18. 32.4
  19. 30.4
  20. 33.9
  21. 21.5
  22. 15.5
  23. 15.2
  24. 13.3
  25. 19.2
  26. 27.3
  27. 26
  28. 30.4
  29. 15.8
  30. 19.7
  31. 15
  32. 21.4
Out[63]:
  1. 21
  2. 21
  3. 22.8
  4. 21.4
  5. 18.7
  6. 18.1
  7. 14.3
  8. 24.4
  9. 22.8
  10. 19.2
  11. 17.8
  12. 16.4
  13. 17.3
  14. 15.2
  15. 10.4
  16. 10.4
  17. 14.7
  18. 32.4
  19. 30.4
  20. 33.9
  21. 21.5
  22. 15.5
  23. 15.2
  24. 13.3
  25. 19.2
  26. 27.3
  27. 26
  28. 30.4
  29. 15.8
  30. 19.7
  31. 15
  32. 21.4
In [64]:
# Ways of Returning Data Frames
mpgdf <- cars['mpg'] # returns 1 col df
head(mpgdf)

mpgdf2 <- cars[1] # returns 1 col df
head(mpgdf2)
Out[64]:
mpg
Mazda RX421
Mazda RX4 Wag21
Datsun 71022.8
Hornet 4 Drive21.4
Hornet Sportabout18.7
Valiant18.1
Out[64]:
mpg
Mazda RX421
Mazda RX4 Wag21
Datsun 71022.8
Hornet 4 Drive21.4
Hornet Sportabout18.7
Valiant18.1

Adding Rows

In [75]:
# Both arguments are DFs)
df2 <- data.frame(col.name.1=2000,col.name.2='new' )
df2

# use rbind to bind a new row!
dfnew <- rbind(df,df2)
Out[75]:
col_name_1col_name_2
12000new
In [76]:
dfnew
Out[76]:
col_name_1col_name_2
11a
22b
33c
44d
55e
66f
77g
88h
99i
1010j
112000new

Adding Columns

In [80]:
df$newcol <- rep(NA, nrow(df)) # NA column
df
Out[80]:
col_name_1col_name_2newcol
11aNA
22bNA
33cNA
44dNA
55eNA
66fNA
77gNA
88hNA
99iNA
1010jNA
In [82]:
df[, 'copy.of.col2'] <- df$col.name.2 # copy a col
df
Out[82]:
col_name_1col_name_2newcolcopy_of_col2
11aNAa
22bNAb
33cNAc
44dNAd
55eNAe
66fNAf
77gNAg
88hNAh
99iNAi
1010jNAj
In [83]:
# Can also use equations!
df[['col1.times.2']] <- df$col.name.1 * 2
df
Out[83]:
col_name_1col_name_2newcolcopy_of_col2col1_times_2
11aNAa2
22bNAb4
33cNAc6
44dNAd8
55eNAe10
66fNAf12
77gNAg14
88hNAh16
99iNAi18
1010jNAj20
In [84]:
df3 <- cbind(df, df$col.name.1)
df3
Out[84]:
col_name_1col_name_2newcolcopy_of_col2col1_times_2df$col_name_1
11aNAa21
22bNAb42
33cNAc63
44dNAd84
55eNAe105
66fNAf126
77gNAg147
88hNAh168
99iNAi189
1010jNAj2010

Setting Column Names

In [86]:
# Rename second column
colnames(df)[2] <- 'SECOND COLUMN NEW NAME'
df

# Rename all at once with a vector
colnames(df) <- c('col.name.1', 'col.name.2', 'newcol', 'copy.of.col2' ,'col1.times.2')
df
Out[86]:
col_name_1SECOND COLUMN NEW NAMEnewcolcopy_of_col2col1_times_2
11aNAa2
22bNAb4
33cNAc6
44dNAd8
55eNAe10
66fNAf12
77gNAg14
88hNAh16
99iNAi18
1010jNAj20
Out[86]:
col_name_1col_name_2newcolcopy_of_col2col1_times_2
11aNAa2
22bNAb4
33cNAc6
44dNAd8
55eNAe10
66fNAf12
77gNAg14
88hNAh16
99iNAi18
1010jNAj20

Selecting Multiple Rows

In [87]:
first.ten.rows <- df[1:10, ] # Same as head(df, 10)
first.ten.rows
Out[87]:
col_name_1col_name_2newcolcopy_of_col2col1_times_2
11aNAa2
22bNAb4
33cNAc6
44dNAd8
55eNAe10
66fNAf12
77gNAg14
88hNAh16
99iNAi18
1010jNAj20
In [88]:
everything.but.row.two <- df[-2, ]
everything.but.row.two
Out[88]:
col_name_1col_name_2newcolcopy_of_col2col1_times_2
11aNAa2
33cNAc6
44dNAd8
55eNAe10
66fNAf12
77gNAg14
88hNAh16
99iNAi18
1010jNAj20
In [91]:
# Conditional Selection
sub1 <- df[ (df$col.name.1 > 8 & df$col1.times.2 > 10), ]
sub1

sub2 <- subset(df, col.name.1 > 8 & col1.times.2 > 10)
sub2
Out[91]:
col_name_1col_name_2newcolcopy_of_col2col1_times_2
99iNAi18
1010jNAj20
Out[91]:
col_name_1col_name_2newcolcopy_of_col2col1_times_2
99iNAi18
1010jNAj20

Selecting Multiple Columns

In [92]:
df[, c(1, 2, 3)] #Grab cols 1 2 3
Out[92]:
col_name_1col_name_2newcol
11aNA
22bNA
33cNA
44dNA
55eNA
66fNA
77gNA
88hNA
99iNA
1010jNA
In [93]:
df[, c('col.name.1', 'col1.times.2')] # by name
Out[93]:
col_name_1col1_times_2
112
224
336
448
5510
6612
7714
8816
9918
101020
In [94]:
df[, -1] # keep all but first column
Out[94]:
col_name_2newcolcopy_of_col2col1_times_2
1aNAa2
2bNAb4
3cNAc6
4dNAd8
5eNAe10
6fNAf12
7gNAg14
8hNAh16
9iNAi18
10jNAj20
In [95]:
df[, -c(1, 3)] # drop cols 1 and 3
Out[95]:
col_name_2copy_of_col2col1_times_2
1aa2
2bb4
3cc6
4dd8
5ee10
6ff12
7gg14
8hh16
9ii18
10jj20

Dealing with Missing Data

Dealing with missing data is a very important skill to know when working with data frames!

In [97]:
any(is.na(df)) # detect anywhere in df
Out[97]:
FALSE
In [98]:
any(is.na(df$col.name.1)) # anywhere in col
Out[98]:
FALSE
In [101]:
# delete selected missing data rows
df <- df[!is.na(df$col), ]
In [102]:
# replace NAs with something else
df[is.na(df)] <- 0 # works on whole df
In [103]:
df$col[is.na(df$col)] <- 999 # For a selected column

All done!

Hope this helps! Remember to use this as a reference during your exercises and projects!